Calculating Euclidean distance

Euclidean distance is simply the geometric distance between the two points. Recall the basic Pythagorean formula: \[a^2 + b^2 = c^2\] where \(a\) and \(b\) are sides of a right triangle and \(c\) is the hypotenuse. Specifically, \(a^2\) is the difference between the two points in the first dimension squared, and \(b^2\) is the difference in the second dimension squared. To get the Euclidian distance \(c\) we take the \(\sqrt{c^2}\), or simply \(\sqrt{a^2 + b^2}\).

To express Euclidian distance in multiple dimensions, we can use the same formula, but we keep adding dimensions. e.g. \(a^2 + b^2 + c^2 = d^2\). It now becomes more convenient to change our notation a bit. Let’s say we have \(n\) dimensions, and let’s call them \(D_1, D_2, ... D_n\). Now for any two points \(\mathbf{x} = (x_1, ... , x_n)\) and \(\mathbf{y} = (y_1, ... , y_n)\) in \(n\)-dimensional space, we can write the Euclidian distance as:

\[ d_{xy} = \sqrt{\sum\limits_{i=1}^{n}{(y_i - x_i)} ^2} \] ## Pairwise distances between genes

To find the Euclidean distance between GeneA and GeneB, let’s refer to the value of GeneA in Sample1 as \(A_1\) and the value of GeneA in Sample2 as \(A_2\); let’s call the value of GeneB in Sample1 \(B_1\) and in Sample2 \(B_2\); and so on.

The Euclidean distance between GeneA and GeneB can now be calcuated using the following formula:

\[ d_{AB} = \sqrt{ \left( B_1-A_1 \right) ^2+ \left( B_2-A_2 \right) ^2 + \left( B_3-A_3 \right)^2 + \left( B_4-A_4 \right) ^2} = \sqrt{ \sum{_{i=1}^4} {(B_i - A_i) ^2 }} \]

Loading the data

We will subset the data to only include the columns we are interested in. Then we will separate the data into test and training set.

data = read.table("all.15k.patients.txt", 
                  header=1)

data$size = scale(data$size)
data$nodespos = scale(data$nodespos)

specific_data = data[,c("alivestatus","grade",
                        "nodespos","nodesexam",
                        "size","pgr","er")]

specific_data$alivestatus = as.factor(specific_data$alivestatus)

sample(15000, 5000, replace=F)->data.test.index

data.test = specific_data[data.test.index,]
data.train = specific_data[-data.test.index,]
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
ggplot(specific_data)+geom_point(mapping = aes(x=nodespos, y=size, color=alivestatus))

Clustering

Now that we have calculated the distance between the genes, we can group the genes that are close together. There are two main clustering methods : Hierarchical & K-means. Hierarchical is a bottoms-up approach where we first group together genes that are most similar and work our way up. K-means clustering is when you know how many clusters you need and then you try their centers.

Hierarchical

Hierarchical clustering can be constructed in many way, but the most simplest is the UPGMA method, which is also used for constructing phylogenetic trees. Steps are:

  • Select to genes that have the shortest distance. Height of the branch is the distance.
  • Group them as an entity and recalculate its distance to other genes. This is called the linkage method. We will use the average method which means we will use the averave distance between the two elements being merged to the other element ( See example for details )
  • repeat until all genes are grouped.
spec_hclust = hclust(dist(specific_data[,c("nodespos","size")]))

plot(spec_hclust)

K-means clustering

For K-means clustering you must first decide on a \(k\). In this case let’s pick 2. This means 2 centers, representing two differnet groups, will be randomly placed in our dataset, and then points will be assigned to the group whose center they are closest to.

First let’s select a random number between the lowest and largest values of the ABCD matrix for each of the four coordinates for the 2 centers.

spec_kmeans = kmeans(specific_data[,c("nodespos","size")], 2)

specific_data$kmeans = spec_kmeans$cluster

table(specific_data$alivestatus, specific_data$kmeans)
##    
##         1     2
##   0 11314   931
##   1  1763   992
ggplot(specific_data)+geom_point(mapping = aes(x=nodespos, y=size, color=factor(kmeans), shape=factor(alivestatus)))

Let’s pretend we have a new patient who has the mean value for nodespos and mean value for size, so 0,0 How can we determine which cluster it belongs to ?

ndata = c(0,0)
test_data = rbind(spec_kmeans$centers,ndata)
test_dist = dist(test_data, diag = T)
test_matrix = as.matrix(test_dist)
test_matrix
##               1        2     ndata
## 1     0.0000000 2.722390 0.3490104
## 2     2.7223897 0.000000 2.3733794
## ndata 0.3490104 2.373379 0.0000000
test_prob = 1-(test_matrix["ndata",1:2]/sum(test_matrix["ndata",1:2]))
test_prob
##      1      2 
## 0.8718 0.1282

Surely we can add another dimension to our clustering which will surely help in our predictions.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(data=specific_data,
        x=specific_data$nodespos, 
        y=specific_data$size, 
        z=specific_data$grade,
        type="scatter3d", mode="markers",
        color=specific_data$alivestatus)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

K Nearest Neighbor

The strategy of K-Nearest Neighbor is to simply look at the values that are closes to it and assign the classification accordingly. k stands for the number of closest neighbors it will look for and classify accordingly. For example, if k=3, then a test point will look for the 3 closest neighbors. If 2/3 are classified as 1 and 1/3 is classified as 0, then the classification of the test point is simply 1.

So there is no model to create, simply know the points and predict the rest.

You can guess that changing the k will have a large effect on the it’s classification. The larger the k, the more smooth the separation between them, where as the smaller the k, the tighter the boundaries making them more prone to over-fitting.

Ideally you want to try several k values to see which gives the least error.

Running K-NN

library("class")

data.knn = knn(data.train[,c("nodespos","size")], data.test[,c("nodespos","size")], 
               data.train$alivestatus, k=3)

table(data.knn, data.test$alivestatus)
##         
## data.knn    0    1
##        0 3886  743
##        1  167  204

Calculate the AUC

For the AUC we don’t just need the classifications, but also need the probabilities.

data.knn.prob = knn(data.train[,c("nodespos","size")], data.test[,c("nodespos","size")], 
               data.train$alivestatus, k=3,prob = T)



library(AUC)
## AUC 0.3.0
## Type AUCNews() to see the change log and ?AUC to get an overview.
data.knn.prob.roc = roc(attr(data.knn.prob,"prob"), data.test$alivestatus)
auc(data.knn.prob.roc)
## [1] 0.2793373

This doesn’t work because KNN is giving you the probability that the prediction is correct, not the probability of death. To do this correctly we have to 1-prob for each prediction of alive ( value of 0 )

alive_test = which(data.knn.prob == 0)
attr(data.knn.prob, "prob")[alive_test] = 1 - attr(data.knn.prob, "prob")[alive_test]
data.knn.prob.roc = roc(attr(data.knn.prob,"prob"), data.test$alivestatus)
auc(data.knn.prob.roc)
## [1] 0.7558827
ggplot(data.test)+geom_point(mapping = aes(x=nodespos, y=size, color=factor(data.knn.prob), shape=factor(alivestatus)))

Changing the k value

Let’s try 100

#################
data.knn.prob = knn(data.train[,c("nodespos","size")], data.test[,c("nodespos","size")], 
                    data.train$alivestatus, k=100, prob = T)

alive_test = which(data.knn.prob == 0)
attr(data.knn.prob, "prob")[alive_test] = 1 - attr(data.knn.prob, "prob")[alive_test]

data.knn.prob.roc = roc(attr(data.knn.prob,"prob"),
                        data.test$alivestatus)
auc(data.knn.prob.roc)
## [1] 0.7803298
ggplot(data.test)+geom_point(mapping = aes(x=nodespos, y=size, color=factor(data.knn.prob), shape=factor(alivestatus)))